Skip to content

Changed stream format to make compatible with MiGz#2

Closed
cielavenir wants to merge 1 commit intovinlyx:masterfrom
cielavenir:migz_compatibility
Closed

Changed stream format to make compatible with MiGz#2
cielavenir wants to merge 1 commit intovinlyx:masterfrom
cielavenir:migz_compatibility

Conversation

@cielavenir
Copy link
Contributor

Firstly I should mention that https://github.com/vinlyx/mgzip/blob/0.1.0/mgzip/multiProcGzip.py#L492 's -8 corresponds to https://github.com/vinlyx/mgzip/blob/0.1.0/mgzip/multiProcGzip.py#L316 's +8, which is CRC32 + block size. Actually you have already figured out the -8.
Also as indexed gzip, you should not need QWORD as member size or block size. Then you can get the latter from gzip footer.

Now I start the main topic - although your concept is great, your format is very rare; no other tools can open it as indexed gzip.
Recently linkedin invented a format named MiGz: https://github.com/linkedin/migz
This implements the above mentions as well as recording only compressed_size to the extra header.
So I tried to change your code a little bit for the interoperability with MiGz.

By the way note that get_index() needs to be rewritten. (I could work on it after I hear from you.)

@vinlyx
Copy link
Owner

vinlyx commented Sep 24, 2019

Hi cielavenir, thank you very much for your explaination and suggestion. I have already merged your first PR #1 .

Thanks for your information, I didn't notice linkedin's repository of MiGz, but it seems the idea is almost the same.

I would like to explain the original reason to put a QWORD into extra flag to record size of raw data:
As you may know there is a inherited issued of gzip format which is impossible to get the exactly uncompressed size without decompressing the file when the raw file is >4GB (https://bugzilla.redhat.com/show_bug.cgi?id=752040). That was caused by the originally design of gzip formt using 32bit ISIZE to record the raw file size in 1952.

The parimary purpose of mgzip is inventing a faster way to process a large file, specifically files may larger than 100GB. Using original 32bit ISIZE to repersent raw size will protentially limiting the member size to 4GB, but I want to keep the potential possiblity to use member size >4GB.

But it is opened to discuess and I am also looking the document of BGZF and RAZF to see whether there is a better solution.

Any comment and suggestion is welcome. Thanks!

@cielavenir
Copy link
Contributor Author

Yes, but that applies if the block size is more than 4GB. mgzip(IndexedGzip) or MiGz should not be designed that way (to be clear, block size is what you mention as 200MB in readme.md).

By the way, I have a suite to handle such files: https://github.com/cielavenir/7bgzf/tree/dev
(And actually I was looking for other formats and happened to find this module)

@cielavenir
Copy link
Contributor Author

moved to #6

@cielavenir cielavenir closed this Jun 6, 2020
timhughes referenced this pull request in pgzip/pgzip Sep 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants